59 research outputs found

    Robust distance correlation for variable screening

    Full text link
    High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dimensional data due to low computational efficiency and unstable algorithm. Sure screening methods have become popular alternatives by first rapidly reducing the dimension using simple measures such as marginal correlation then applying any regularization methods. A number of screening methods for different models or problems have been developed, however, none of the methods have targeted at data with heavy tailedness, which is another important characteristics of modern big data. In this paper, we propose a robust distance correlation (``RDC'') based sure screening method to perform screening in ultrahigh-dimensional regression with heavy-tailed data. The proposed method shares the same good properties as the original model-free distance correlation based screening while has additional merit of robustly estimating the distance correlation when data is heavy-tailed and improves the model selection performance in screening. We conducted extensive simulations under different scenarios of heavy tailedness to demonstrate the advantage of our proposed procedure as compared to other existing model-based or model-free screening procedures with improved feature selection and prediction performance. We also applied the method to high-dimensional heavy-tailed RNA sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer cohort and RDC was shown to outperform the other methods in prioritizing the most essential and biologically meaningful genes

    Bayesian indicator variable selection of multivariate response with heterogeneous sparsity for multi-trait fine mapping

    Full text link
    Variable selection has been played a critical role in contemporary statistics and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but they mainly target at only one response. As more data being collected nowadays, it is common to obtain and analyze multiple correlated responses from the same study. Running separate regression for each response ignores their correlation thus multivariate analysis is recommended. Existing multivariate methods select variables related to all responses without considering the possible heterogeneous sparsity of different responses, i.e. some features may only predict a subset of responses but not the rest. In this paper, we develop a novel Bayesian indicator variable selection method in multivariate regression model with a large number of grouped predictors targeting at multiple correlated responses with possibly heterogeneous sparsity patterns. The method is motivated by the multi-trait fine mapping problem in genetics to identify the variants that are causal to multiple related traits. Our new method is featured by its selection at individual level, group level as well as specific to each response. In addition, we propose a new concept of subset posterior inclusion probability for inference to prioritize predictors that target at subset(s) of responses. Extensive simulations with varying sparsity and heterogeneity levels and dimension have shown the advantage of our method in variable selection and prediction performance as compared to existing general Bayesian multivariate variable selection methods and Bayesian fine mapping methods. We also applied our method to a real data example in imaging genetics and identified important causal variants for brain white matter structural change in different regions.Comment: 29 pages, 3 figure

    Differential expression and feature selection in the analysis of multiple omics studies

    Get PDF
    With the rapid advances of high-throughput technologies in the past decades, various kinds of omics data have been generated from many labs and accumulated in the public domain. These studies have been designed for different biological purposes, including the identification of differentially expressed genes, the selection of predictive biomarkers, etc. Effective meta-analysis of omics data from multiple studies can improve statistical power, accuracy and reproducibility of single study. This dissertation covered a few methods for differential expression (Chapter 2 and 3) and feature selection (Chapter 4) in the analysis of multiple omics studies. In Chapter 2, we proposed a full Bayesian hierarchical model for RNA-seq meta-analysis by modeling count data, integrating information across genes and across studies, and modeling differential signals across studies via latent variables. A Dirichlet process mixture prior was further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies. We used both simulations and a real application on multiple brain region HIV-1 transgenic rats to demonstrate improved sensitivity, accuracy and biological findings of our method. In Chapter 3, we extended the previous Bayesian model to jointly integrate transcriptomic data from the two platforms: microarray and RNA-seq. In Chapter 4, we considered a general framework for variable screening with multiple omics studies and further proposed a novel two-step screening procedure for high-dimensional regression analysis in this framework. Compared to the one-step procedure and rank-based sure independence screening procedure, our procedure greatly reduced false negative errors while keeping a low false positive rate. Theoretically, we showed that our procedure possesses the sure screening property with weaker assumptions on signal strengths and allows the number of features to grow at an exponential rate of the sample size. Public health significance: The proposed methods are useful in detecting important biomarkers that are either differentially expressed or predictive of clinical outcomes. This is essential for searching for potential drug targets and understanding the disease mechanism. Such findings in basic science can be translated into preventive medicine or potential treatment for disease to promote human health and improve the global healthcare system

    Comparing empirical kinship derived heritability for imaging genetics traits in the UK biobank and human connectome project

    Get PDF
    Imaging genetics analyses use neuroimaging traits as intermediate phenotypes to infer the degree of genetic contribution to brain structure and function in health and/or illness. Coefficients of relatedness (CR) summarize the degree of genetic similarity among subjects and are used to estimate the heritability – the proportion of phenotypic variance explained by genetic factors. The CR can be inferred directly from genome-wide genotype data to explain the degree of shared variation in common genetic polymorphisms (SNP-heritability) among related or unrelated subjects. We developed a central processing and graphics processing unit (CPU and GPU) accelerated Fast and Powerful Heritability Inference (FPHI) approach that linearizes likelihood calculations to overcome the ∼N2–3 computational effort dependency on sample size of classical likelihood approaches. We calculated for 60 regional and 1.3 × 105 voxel-wise traits in N = 1,206 twin and sibling participants from the Human Connectome Project (HCP) (550 M/656 F, age = 28.8 ± 3.7 years) and N = 37,432 (17,531 M/19,901 F; age = 63.7 ± 7.5 years) participants from the UK Biobank (UKBB). The FPHI estimates were in excellent agreement with heritability values calculated using Genome-wide Complex Trait Analysis software (r = 0.96 and 0.98 in HCP and UKBB sample) while significantly reducing computational (102–4 times). The regional and voxel-wise traits heritability estimates for the HCP and UKBB were likewise in excellent agreement (r = 0.63–0.76, p \u3c 10−10). In summary, the hardware-accelerated FPHI made it practical to calculate heritability values for voxel-wise neuroimaging traits, even in very large samples such as the UKBB. The patterns of additive genetic variance in neuroimaging traits measured in a large sample of related and unrelated individuals showed excellent agreement regardless of the estimation method. The code and instruction to execute these analyses are available at www.solar-eclipse-genetics.org

    Age-Related Gene Expression in the Frontal Cortex Suggests Synaptic Function Changes in Specific Inhibitory Neuron Subtypes

    Get PDF
    Genome-wide expression profiling of the human brain has revealed genes that are differentially expressed across the lifespan. Characterizing these genes adds to our understanding of both normal functions and pathological conditions. Additionally, the specific cell-types that contribute to the motor, sensory and cognitive declines during aging are unclear. Here we test if age-related genes show higher expression in specific neural cell types. Our study leverages data from two sources of murine single-cell expression data and two sources of age-associations from large gene expression studies of postmortem human brain. We used nonparametric gene set analysis to test for age-related enrichment of genes associated with specific cell-types; we also restricted our analyses to specific gene ontology groups. Our analyses focused on a primary pair of single-cell expression data from the mouse visual cortex and age-related human post-mortem gene expression information from the orbitofrontal cortex. Additional pairings that used data from the hippocampus, prefrontal cortex, somatosensory cortex and blood were used to validate and test specificity of our findings. We found robust age-related up-regulation of genes that are highly expressed in oligodendrocytes and astrocytes, while genes highly expressed in layer 2/3 glutamatergic neurons were down-regulated across age. Genes not specific to any neural cell type were also down-regulated, possibly due to the bulk tissue source of the age-related genes. A gene ontology-driven dissection of the cell-type enriched genes highlighted the strong down-regulation of genes involved in synaptic transmission and cell-cell signaling in the Somatostatin (Sst) neuron subtype that expresses the cyclin dependent kinase 6 (Cdk6) and in the vasoactive intestinal peptide (Vip) neuron subtype expressing myosin binding protein C, slow type (Mybpc1). These findings provide new insights into cell specific susceptibility to normal aging, and suggest age-related synaptic changes in specific inhibitory neuron subtypes

    The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016

    Get PDF
    The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.Peer reviewe

    Embedded System Education in Zhejiang University

    No full text
    Abstract: Embedded system becomes more and more useful in all of the domain of our society. An embedded system education in universities is an urgent requirement. College of Computer Science of Zhejiang University is trying to improve its education in embedded system. This paper describes the achievement Zhejiang University has accomplished since 2002. Key–Words: Embedded System, Curriculum, Course reformation, Embedded Softwar

    CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

    No full text
    Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly
    • …
    corecore